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Introduction 

This  paper  suggests  two  techniques  for  reducing  the  costs  of 
operating  certain  database  applications  that  use  a  hierarchical 
structure  with  inversion  on  low-level  fields.  These  techniques  are: 

1)  introducing  "family  redundancy,"  i.e.,  introducing  redundant 
information  into  a  record  that  was  obtained  directly,  or 
through  a  summarization  process,  fran  its  "parent,"  "child,"  or 
"sibling"  records,  without  altering  the  structure  of  the 
hierarchy,  or  eliminating  fields  from  any  records. 

2)  incorporating  one  record  level  into  another. 

The  paper  contains  formulas  which  make  explicit  the  costs  and  savings 
associated  with  each  technique.  A  case  study  will  be  presented  in  which 
the  formulas  are  used  to  predict  possible  savings.  The  observed 
performance  improvements,  following  the  implementation  of  these 
techniques,  are  reported.  These  improvements  corresponded  to  those 
predicted  by  the  formulas* 

The  case  study  chosen  is  a  birth  defects  diagnostic  assist  system. 
For  this  syston,  these  techniques  reduced  the  average  costs  by 
approximately  eighty  per  cent. 
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Background  Concepts  in  Hierarchical  Data  Bases 

The  hierarchical  model  of  data  is  extensively  used  in  a  broad  number 
of  applications*.  For  certain  applications,  those  characterized  by  a 
relatively  stable  database  with  pre-defined  queries  or  operations,  we 
have  found  the  hierarchical  system  to  be  quite  appropriate  [4] . 

A  key  goal  of  a  systen  designer,  in  general,  is  to  reduce  data 
redundancy,  since  redundancy  wastes  space  and  creates  the  potential  for 
inconsistency  in  the  data.  However,  redundancy  can  also  yield 
significant  improvement  in  access  times,  and  this  savings  must  be 
weighed  against  the  corresponding  costs**.  In  this  paper,  two  general 
redundancy-based  database  design  techniques  are  introduced.  In  order  to 
illustrate  these  techniques,  consider  the  following  hypothetical 
database: 

Figure  lA  depicts  a  hierarchical  structure  of  three  record  types: 

*  STUDENT  -  containing  ID  number,  name  and  address. 

*  YEAR  -  containing  year  and  the  department  in  which  the  student 

was  registered  during  that  year. 

*  COURSE  -  containing  course  number  and  the  grade  received  by  the 

student. 

Figure  IB  depicts  one  instance  of  that  hierarchy  in  which  the  student 
is  Steven  Melnick,  and  his  address  is  Eastgate.  He  has  attended  school 


*  IMS,   a  hierarchical  data  base  management  system,   is  reported  as 
having  the  second  most  number  of  installations  in  the  world  [71 . 

**  See  Severance  for  a  general  model  of  file  costs  [9]. 
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for  two  years.  In  1978,  he  was  in  Department  6,  and  in  1979,  he  switched 
to  Department  15.  He  took  two  courses  during  1978,  and  one  in  1979. 

This  database  organization  is  particularly  well-suited  for  responding 
to  queries  such  as:  "Print  all  grades  for  Melnick."  Assuming  there  is 
an  index  on  the  name  field,  Melnick 's  STUDENT  record  could  be  accessed 
directly,  and  all  of  the  YEAR  and  COURSE  records  read  sequentially. 
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STUDENT 
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Figure  lA:   Sample  Hierarchy  (*  =  inverted  field) 
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Figure  IB:  Instance  of  sample  hierarchy 
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Improving  Performance  with  Family  Redundancy 

Let  us  examine  the  steps  involved  in  answering  the  following,  moi(> 
difficult  query:  "List  all  students  who  received  a  grade  point  average 
(GPA)  greater  than  3.4  in  1979." 

Assuming  that  the  YEAR  field  is  indexed,  one  access  would  be  required 
to  retrieve  the  1979  YEAR  record  for  the  first  student  enrolled  that 
year;  then  one  access  would  be  required  for  each  of  the  COURSE  records 
underneath  that  YEAR  record.  Using  the  grades  retrieved,  one  student's 
GPA  for  1979  would  then  be  computed.  If  the  average  exceeded  3.4,  an 
access  would  be  made  upward  from  the  YEAR  record  to  the  parent  STUDENT 
record  to  get  the  name  of  the  student.  These  steps  would  be  repeated 
for  every  student  who  was  enrolled  in  1979.  Thus,  it  would  be  necessary 
to  access  the  middle  of  the  hierarchy,  go  downward,  and  if  the  query 
condition  was  met,  return  to  the  root  record. 

In  order  to  avoid  accessing  all  of  the  1979  COURSE  records  for  each 
student,  the  grade  point  average  could  be  computed  and  placed  in  the 
YEAR  record,  as  illustrated  in  Figures  2A  and  2B.  Under  this  schone,  a 
maximum  of  two  accesses  per  student  are  required  to  answer  the  query, 
and  the  COURSE  records  need  never  be  accessed.  Thus,  all  downward 
accesses  have  been  eliminated,  and  it  is  now  only  necessary  to  access 
the  middle  of  the  hierarchy,  and  if  the  query  was  met,  return  to  the 
root  record. 

However,  redundancy  has  been  introduced,  resulting  in  extra  storage 
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and  maintenance.  For  example,  if  Melnick's  grade  in  course  15.565  is 
changed  to  an  'A',  then  both  the  COURSE  record  and  the  YEAR  record  must 
be  updated.  In  the  structure  of  Figure  lA,  on  the  other  hand,  only  the 
COURSE  record  need  be  updated. 


STUDENT 


I  III 

I   ID  NUMBER*  I   NAME  I   ADDRESS   I 

I             III 
+ + 
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YEAR 


I 


COURSE 


I  I  I 

I   YEAR*    I    DEPT    |   GPA 

I          I           I 
H 


+ + 

I  I  I 

I     COURSE*    I   GRADE        I 


Figure  2A:  Placing  GPA  in  YEAR 
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I  123-456   I  Steven  Melnick    I    Eastgate 


-+     +- 


1978 


3.5  I     I  1979  I   15   I  4.0 


-+ 
I 


-+     ^ — 


15.564  I  A  I   I  15.565  I  B  1      I   15.571  I  A 


+     + — 

Figure  2B:  Instance 
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In  the  previous  example,  accesses  were  reduced  by  bringing  redundant 
information  upward.  Accesses  can  also  be  reduced  by  introducing 
redundant  information  within  a  single  level.  For  example,  consider  the 
query:  "List  all  people  who  received  an  'A'  in  15.565  and  have  a  GPA 
exceeding  3.4."  If  the  structure  in  Figure  lA  is  used,  it  would  be 
necessary  to  perform  the  lengthy  process  of  computing  the  GPA  for  every 
student  who  received  an  'A'  in  15.565.  If  the  structure  in  Figure  2A  is 
used,  the  number  of  accesses  is  reduced  to  two  per  each  15.565  COURSE 
record.  However,  the  accesses  can  be  reduced  even  further,  to  one  per 
15.565  COURSE  record,  by  repeating  the  year's  GPA  in  each  of  the  COURSE 
records  for  that  year,  as  depicted  in  Figures  3A  and  3B.  Although  this 
seons  inordinately  redundant,  it  may  result  in  a  more  efficient  system, 
depending  on  the  frequency  of  such  queries. 

These  two  examples  of  database  restructuring  illustrate  the  general 
concept  of  "family  redundancy,"  a  technique  whereby,  "family" 
information  obtained  fran  related  records  (parents,  siblings,  children, 
etc)  is  introduced  into  a  record  in  order  to  improve  the  performance  of 
certain  queries.  Note  that  there  is  no  change  in  capabilities,  i.e., 
any  query  that  could  have  been  answered  before  restructuring  can  be 
answered  at  least  as  efficiently  after  restructuring.  In  other  words, 
"family  redundancy"  will  not  increase  {and  may  reduce)  the  number  of 
accesses  needed  to  respond  to  any  query  or  operation.  This  is  true 
because  "family  redundancy"  neither  alters  the  basic  arrangement  of  the 
record  types,  nor  deletes  any  of  the  original  fields,  implying  that  any 
query  could  always  be  answered  by  ignoring  the  newly-inserted 
information  and  instead,  following  the  original  procedure.   Hence,  the 
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introduction  of  family  information  can  never  increase  the  number  of 
accesses,  and  as  was  shown  in  the  previous  examples,  can  sometimes 
reduce  them. 

Note  that  two  assumptions  have  been  made  in  the  above  example.  The 
first  is  that  the  database  is  accessed  using  a  high-level  language 
interface,  where  each  database  movement  is  pre-defined  by  the 
programmer.  This  can  be  significantly  different  fron  the  query  processor 
situation  where  the  exact  sequence  of  database  operations  is  unknown  to 
the  applications  programmer. 

The  second  assumption  is  that  there  is  no  use  of  virtual  data  [5]. 
That  is  to  say,  the  DBMS  cannot  store  in  the  GPA  field  an  "equation" 
based  on  the  GRAEE  field. 
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STUDENT 


H + 

I  III 

I   ID  NUMBER*  I   NAME  |   ADDRESS   1 


+- 
YEAR 


H + 

I  I  I  I 

I   YEAR*    I    DEPT    I   GPA    I 


+- 

COURSE 


+ + 

I  I  I  I 

I   COURSE*   I   GRADE   I    GPA    I 

I  I  I  I 

Figure  3A:     Repeating  GPA  in  YEAR 


I  123-456       I     Steven  Melnick         I         Eastgate       I 

II  II 
+ + 

I                                                         I 
+ +  + + 

II  I  I  I  I  II 

I    1978    1  6  !    3.5    I  I    1979    I        15        I    4.0    I 


I  I 


I    15.564    I   A   I    3.5    I    I    15.565    1    3    I    3.5    I    1    15.571    I    A    I    4.0    I 

I                    I         I             I    I                    I         I             I     I                    III 
^ +  -(. +  + + 

Figure  3B:  Instance 
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Quantification  of  Savings  Potential  Using  Family  Redundancy 

There  is  a  query  speed  processing  advantage  and  countervailing  space 
and  maintenance  costs  associated  with  family  redundancy: 

1)  The  savings  due  to  faster  queries  can  be  canputed  as  follows: 

Ri     =  difference  in  cost  per  query  i,  before  and  after 

database  restructuring  (function  of  CPU,  I/O, 
DBMS) 

Ni     =  average  number  of  queries  of  type  i  during  time 
period  t 

Savings  =  51  RiNi  (1) 

2)  The  costs  of  the  extra  space  are: 

b    =  cost  per  byte  per  time  period  t  ($A>yte) 

B    =  number  of  extra  bytes 

Cost  =  (b  *  B)  (2) 

3)  The  maintenance  costs  associated  with  updating  the  extra  records 

are: 

Uj   =  cost  of  updating  extra  records  for  data  update  type  j 
Vj   =  number  of  data  updates  of  type  j  per  tLme  period  t 
Cost  =  ?^U3Vj  (3) 

Therefore,  the  overall  savings  per  time  period  t  are: 

Savings(t)  =  (  H  RiNi)  -  (b*B)  -  (  i;  UjVj)  (4) 

In  a  later  section  of  this  paper,  Equation  4  shall  be  used  to  predict 
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savings  in  an  actual  system. 
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Improving  Performance  with  Collapsing  of  Levels 

If  the  structure  of  Figure  lA  was  mainly  used  to  answer  the  query: 
"Print  all  the  grades  for  every  student,"  then  the  total  number  of 
records  accessed  would  be  the  sum  of  the  number  of  STUDENT,  YEAR  and 
COURSE  records  in  the  database.  If  YEAR  is  collapsed  into  COURSE  as 
depicted  in  Figure  4A,  then  the  number  of  accesses  is  only  the  sum  of 
the  number  of  STUDENT  and  COURSE  records. 

This  example  illustrates  the  concept  of  "collapsing  of  levels," 
whereby,  levels  of  the  hierarchy  are  eliminated  and  the  information  that 
they  contained  is  duplicated  in  their  children.  Contrary  to  family 
redundancy,  collapsing  of  levels  does  not  always  give  beneficial 
results,  and  in  fact,  it  can  give  significantly  worse  results.  For 
example,  the  query,  "What  is  Melnick's  GPA  in  1979?"  results  in  having 
to  access  all  COURSE  records  in  1978,  as  well  as  those  of  1979.  Before 
collapsing,  one  access  could  be  made  to  the  1979  YEAR  record,  followed 
by  one  to  each  of  its  COURSE  records. 

Since  the  number  of  accesses  saved  per  report  can  be  negative,  the 
decision  whether  or  not  to  collapse  levels,  is  not  always  clear-cut. 
Generally,  a  level  can  be  eliminated  if  that  level  has  a  small  number  of 
bytes,  and  does  not  contain  information  that  is  often  used  for  decision- 
making in  the  hierarchy.  If  the  sample  hierarchy  contained  only  the 
year  in  the  YEAR  record,  and  if  it  was  rarely  required  to  pick  out  a 
specific  year  of  a  student's  file  (as  might  be  the  case  with  a  history 
file) ,  then  there  might  be  a  positive  benefit  in  collapsing  the  YEAR 
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record  into  the  COURSE  record.  (The  inclusion  of  the  GPA  is  an  entirely 
separate  matter.) 

STUDENT 


I   ID  NUMBER*  I   NAME  I  ADDRESS   | 


COURSE 


I   COURSE*    I   GRADE   I  YEAR   I  DEFT    I 
I  I  I        I         I 

Figure  4A:  Collapsing  of  levels 


123-456  I  Steven  Melnick  I  Eastgate 


H + 

I  III  I 

I    15.564    I   A   I    1978    I  6  j 


I 


I 


+ 

I  III 

1    15.565   I   B   !    1978    I  6 


I    15.571    I    A    I    1979    I        15 
I  III 


-+ 

I 

I 

I 
-+ 


Figure  4B:   Instance 
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Quantification  of  Savings  Potential  Using  Collapsing  of  Levels 

The  savings  function  for  collapsing  of  levels  is  identical  to 
Equation  4  with  the  following  exceptions: 

1)  The  savings  due  to  faster  queries  are  the  same  as  before,  but 
Ri  can  now  be  negative. 

2)  Collapsing  does  not  necessarily  involve  extra  storage.  The 
nimber  of  extra  bytes,  B,  is  determined  as  follows: 

Extra  Storage  =    { (C*D)  -  [P(I>+D')]  -  Q}  (5) 

where  C  =  number  of  'child'  records 
P  =  number  of  'parent'  records 
D  =  number  of  data  bytes  per  'parent'  record 
D'  =  number  of  index  bytes  per  'parent'  record 
Q  =  fixed  number  of  bytes  required  to  maintain  a  level 
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Case  Study 

To  illustrate  the  potential  for  dramatic  savings  in  an  actual 
application,  the  Birth  Defects  Information  System  (BDIS) ,  a  joint 
research  project  of  the  March  of  Dimes  Birth  Defects  Foundation  and  the 
Tufts  New  England  Medical  Center,  will  be  examined.  BDIS  provides 
various  computer  services  to  aid  physicians  in  treating  patients  with 
birth  defects.  It  is  unplemented  on  an  IBM/370  under  VM/CMS  and  uses  a 
database  management  system  called  FXXUS.  This  case  study  centers  on  the 
Diagnostic  Assist  Facility,  one  service  of  BDIS,  which  is  still  in  the 
testing  stage. 

The  Diagnostic  Assist  facilty  works  as  follows: 

1)  Physicians  choose  from  a  report  those  symptoms  that  describe 
their  patient's  condition,  e.g.,  cleft  palate,  mental 
retardation.  (SELECTION  PHASE) 

2)  Birth  defects  which  exhibit  any  of  the  chosen  symptoms  are 
retrieved  from  the  database  for  consideration.  (RETRIEVAL 
PHASE) 

3)  A  score  is  computed  for  each  of  the  considered  birth  defects, 
based  on  the  strength  of  the  match  between  the  symptoms 
selected  by  the  physician  and  those  stored  in  the  database  for 
the  birth  defect.  The  birth  defects  are  then  ranked  according 
to  the  score,  and  the  top  ten  are  maintained.  (SCORING  PHASE) 

4)  The  physician  is  asked  questions  relating  to  the  top  birth 
defects  in  order  to  more  carefully  refine  the  diagnosis.  The 
answers  are  used  to  re-score  the  birth  defects.    (CHALLENGE 
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PHASE) 

For  example,  if  the  physician  were  to  enter  the  following  symptoms: 

disproportionate  short  stature 

rhizomelia 

cleft  palate 

the  computer  might  respond  with  the  following  list  of  potential 
syndrcxnes: 

Kneist  Dysplasia 

Hermann-Opitz  Arthrogryposis  Syndrome 

Short  Rib-Polydactyly  Syndrome 

The  major  storage  structure  is  a  hierarchical  database  which  stores 
the  symptoms  that  are  commonly  found  with  each  birth  defect  syndrome. 
(See  Figure  5.)  Symptoms  which  are  always  seen  in  patients  with  that 
birth  defect  are  considered  REQUIRED,  others  are  called  non-required  or 
choice  symptoms.  Closely-related  symptans  for  a  birth  defect  are  grouped 
into  SETS.  Associated  with  each  set  is  a  WEIGHT  and  a  CHOICE  COUNT.  The 
WEIGHT  is  the  number  of  points  given  to  the  syndrome  if  the  criteria  for 
"satisfying"  the  set  are  met.  To  satisfy  a  set,  the  patient's  case  must 
match  on  all  of  the  REQUIRED  symptoms  in  the  set,  plus  some  nunaber  of 
the  choice  symptoms,  as  indicated  by  the  CHOICE  COUNT.  The  total  score 
for  a  syndrome  is  the  sum  of  the  weights  of  the  satisfied  sets, 
normalized  by  the  total  possible  sum  for  that  syndrome.  Figure  6 
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illustrates  an  instance  of  the  database  in  table  format. 


SYNDROME 

+ + 

I  I  I 

I        NUMBER*      I        NAME         | 

I                                I  I 

^ + 

1 
SET  I 

I 

III  I 

I    SET  NUMBER   I   WEIGHT    i    CHOICE  COUNT    I 

I  ~  I  I  "  I 

H + 

I 

SYMPTOM  I 

I 
+ + 

I  I  I 

I    SYMPTOM*    I    REQUIRED    I 

1          I  I 

+ + 


NUMBER  OF 
OCCURRENCES 


700 


2000 


13000 


Figure  5:  The  BDIS  Hierarchy 
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Number:  123     Name:  XYZ 
SET  WEIGHT  CHOICE_COUNT 
1*    4         2 


REQUIRED  SYMPTOM 

eye,  downward  slanting 
eyeball,  deepset 
eyeball,  atrophy  of 
eye,  ocular  hyper tol ism 


2** 


immobile  facial  expression 

mouth,  small 

lips,  puckered 

mouth,  downturned  corners 


Figure  6:  Instance  of  the  BDIS  Hierarchy 


*  In  Set  1  of  the  illustrated  syndrome,  two  of  the  four  symptoms  must  be 
matched  in  order  to  receive  four  points.  One  symptom  does  not  result  in 
any  points,  while  three  or  four  matches  still  give  only  four  points. 

**  In  Set  2,  the  patient  must  have  an  immobile  facial  expression,  and 
one  of  the  other  three  symptoms  to  receive  three  points.  If  the  first 
symptom  is  not  matched,  then  no  points  will  be  awarded  (even  if  all  the 
other  symptoms  are  matched) . 
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The  greatest  usage  of  computer  time  in  the  3DIS  system  is  during  the 

RETRIEVAL  PHASE,  when  the  birth  defect  information  is  retrieved  from  the 

database.  Following  is  a  detailed  explanation  of  the  database  accesses 

that  are  required: 

1)  Get  the  first  symptom  from  the  physician's  list. 

2)  Do  an  indexed  retrieval  to  the  first  occurrence  of  that  SYMPTOM 
record  in  the  database. 

3)  Go  up  to  the  SYNDROME  record  to  find  the  number  of  the  birth 
defect  containing  that  symptom. 

4)  Get  the  next  occurrence  of  that  SYMPTOM  record.  (The  same 
symptom  may  be  present  in  multiple  birth  defects.)  If  more 
exist,  return  to  Step  3. 

5)  Repeat  Steps  2  to  4  for  every  symptom  in  the  physician's  list. 

6)  For  each  of  the  syndromes  that  has  been  retrieved  (after 
eliminating  duplicates),  get  all  of  the  associated  SET  and 
SYMPTOM  records. 

The  last  step  is  required  for  the  following  three  reasons: 

1)  It  is  not  known  if  all  of  the  required  symptoms  for  a  set  have 
been  matched,  until  all  of  the  SYMPTOM  records  in  that  set  have 
been  retrieved. 

2)  The  total  possible  score  (sure  of  the  set  weights)  for  a  birth 
defect  cannot  be  computed,  until  all  of  the  SET  records  for  a 
birth  defect  are  retrieved. 

3)  Though  the  SYMPTOM  records  for  sets  which  have  not  been 
selected  are  not  immediately  needed,  it  is  more  efficient  to 
retrieve  all  of  the  SET  and  SYMPTOM  records  in  one  pass,  than 
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to  selectively  retrieve  records.  This  is  due  to  a  facility  in 
FOCUS,  whereby  multiple  records  can  be  retrieved  with  a  single 
database  call.  (The  effects  of  this  facility  on  database 
access  costs  will  be  explained  in  a  later  section.) 

Thus,  in  order  to  score  the  birth  defects,  the  program  had  to  go  up 
the  database  for  each  of  the  occurrences  of  the  selected  symptoms.  In 
addition,  for  each  of  the  potential  SYNDROMES,  it  was  necessary  to 
retrieve  all  of  the  associated  SET  and  SYMPTOM  records. 
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Restructuring  the  BDIS  Database 

To  improve  efficiency,  the  database  was  restructured  with  the 
following  changes. 

1)  The  number  of  required  symptoms  in  a  set  (REQUIRED  COUNT)  was 
inserted  into  the  SET  record.  This  eliminated  the  need  to 
retrieve  all  of  the  SYMPTOM  records  in  a  set  just  to  determine 
if  the  required  symptoms  had  been  matched. 

2)  The  total  weight  of  the  sets  (TOTAL  WEIGHT)  was  inserted  into 
the  SYNDROME  record.  Now  it  was  no  longer  necessary  to 
retrieve  all  of  the  SET  records  to  determine  the  maximum 
possible  score. 

3)  The  SET  records  were  incorporated  into  the  SYMPTOM  records.  This 

meant  that  the  path  up  to  the  SYNDROME  record  for  an  indexed 
symptom  was  shorter,  and  that  the  total  number  of  records  under 
a  root  SYNDROME  record  was  less. 

The  first  two  changes  are  examples  of  "family  redundancy"  techniques 
that  result  in  the  database  depicted  in  Figure  7A.  The  last  change  is  a 
collapsing  of  levels,  which  gives  the  database  of  Figure  7B. 

With  the  database  of  Figure  7B,  a  score  for  each  birth  defect  can  be 
canputed  by  simply  goire?  up  the  database  for  each  occurrence  of  the 
selected  symptoms.  We  can  then  select  only  the  top  candidates  and 
retrieve  all  of  their  symptoms  for  use  in  the  CHALLENGE  PHASE.  As  a 
result,  the  following  reductions  in  the  number  of  required  accesses  have 
been  made: 
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1)  The  number  of  accesses  required  to  go  up  to  a  root  SYNDROME 
record  is  two  instead  of  three. 

2)  The  number  of  trips  down  the  hierarchy  is  limited  to  ton  (the 
top  birth  defects) .  Formerly,  every  considered  birth  defect 
required  a  downward  trip. 

3)  Less  records  are  retrieved  on  each  of  these  downward  trips,  as 
the  SET  records  have  been  collapsed  into  the  SYMPTOM  records. 

Thus,  compared  to  the  original  syston,  a  significant  reduction  in  the 
number  of  accesses  has  been  made. 
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Figure  7A:  BDIS  database  after  Family  Redundancy 
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Figure  7B:  BDIS  database  after  Collapsing  of  Levels 
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Quantification  of  Savings  Potential  for  BDIS 

Using  Equation  4,  and  subtracting  a  further  manpower  and  computer 

cost  for  changing  the  system  (K) ,  the  following  savings  equation  is 

obtained: 

Savings(t)  =  (  iRiNi)  -  (b*B)  -  (lUjVj)  -  K  (6) 

i  ' 

For  this  case: 

b*B  is  a  minimal  cost.  The  increase  in  storage  was 

under  250K  bytes. 

{  T.  UjVj)      is  small  because  the  data  base  is  updated  only  a 
J 

few  times  per  year.  (In  test  runs,  there  was 
no  significant  difference  in  the  cost  of 
updating  the  old  and  new  databases.) 

i  is  1,  i.e.,   only  one  type  of  query  (diagnostic 

assist)  exists. 

K  is  small  when  averaged  over  several  years  of  use, 

and  in  this  particular  case,  the  restructuring 
coincided  with  a  major  overhaul  of  the  system, 
and  thus,  did  not  amount  to  an  additional 
expense. 
Therefore,  equation  (6)  reduces  to: 

Savings  (t)  =  R  *  N  (7) 

R,  which  is  the  difference  in  cost  of  a  query,  is  exceedingly 
difficult  to  conpute  theoretically.  Firstly,  FOCUS  maintains  its  data  in 
4000  byte  pages,  and  therefore,  an  access  may  or  may  not  imply  an  actual 
disk  I/O.   Secondly,   the  cost  of  retrieving  information  from  the 
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database  consists  of  a  relatively  fixed  charge  of  making  a  database 
call,  plus  a  variable  cost  depending  on  the  number  of  records  accessed. 
Lastly,  the  single  call/multiple  record  facility  of  FOCUS  means  that 
many  records  can  be  retrieved  at  a  smaller  cost  than  if  they  were 
retrieved  independently.  Despite  all  of  these  problems,  and  for  lack  of 
a  better  choice,  the  theoretical  estimate  of  the  number  of  accesses  made 
was  used  to  attenpt  to  predict  system  costs. 

Let  us  therefore,  compute  Ao,  the  number  of  accesses  under  the  old 
systan,  and  Ar,  the  number  of  accesses  under  the  restructured  system. 

From  the  database  it  was  determined  that  each  syndrome  contains  on 
the  average  three  sets  and  19  symptoms.  In  addition,  the  mean  number  of 
syndranes  in  which  a  particular  symptom  exists  is  fifteen.  152  test 
cases  were  examined,  and  each  contained  an  average  of  eleven  symptoms, 
and  retrieved  a  mean  of  210  candidate  birth  defects. 

With  this  information,  Ao  can  be  computed  as  follows: 
Ao  =  (Accesses  up)  +  (Accesses  down) 

Accesses  up   =  #Symptoms/Case  *  #Syndromes/Symptom  *  taccesses  to  go  up 

=  495  (11  *  15  *  3) 

Accesses  down  =  ICarri idates  *  #accesses  to  get  all  symptoms 

=  4830  210  *  (1  +  3  +  19) 

Therefore,  under  the  original  system,  the  number  of  accesses  required 
(Ao)  is  5325. 
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Ar  can  be  computed  in  a  similar  manner  with  the  following  important 
exceptions: 

1)  The  number  of  accesses  required  to  go  up  to  a  root  SYNDROME 
record  is  two  instead  of  three. 

2)  The  number  of  trips  down  the  hierarchy  is  limited  to  the  top 
birth  defects.  (Due  to  scoring  ties,  the  average  number  of 
trips  is  twelve  rather  than  ten.) 

3)  The  number  of  accesses  required  to  get  all  of  the  symptoms  for 
a  birth  defect  is  20  instead  of  23. 

Accesses  up  =  330  (11  *  15  *  2) 

Accesses  down  =  240         (12  *  20) 

The  total  number  of  accesses  under  the  new  system  (Ar)  is  570.  This 
Implies  that  the  new  system  will  use  about  one-ninth  the  number  of 
accesses  as  the  original  systen. 
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Experimental  Results 

In  this  section,  the  observed  results  when  the  system  was  implemented 
are  reported.  The  experiment  compared  I/O  and  CPU  usage  under  the 
original  and  restructured  systans  for  ten  test  cases.  The  cases  had 
fron  one  to  twenty-one  symptoms. 

The  experimental  data  is  summarized  in  Figures  8  and  9.  In  Figure  8, 
the  nun±)er  of  I/O  requests  is  illustrated  and  in  Figure  9,  the  CPU  usage 
is  shown.  The  data  was  obtained  by  monitoring  the  resource  usage  for 
the  retrieval  phase  of  the  program  and  subtracting  the  usage  for  basic 
operations,  i.e.,  opening  databases,  loading  modules,  etc.  Note  that 
since  a  FOCUS  access  does  not  imply  a  disk  I/O,  then  the  absolute  values 
of  the  experimental  results  cannot  be  compared  directly  with  the 
formulated  estimates.  However,  the  following  observations  can  be  made: 

1)  As  expected,  the  I/O  usage  under  the  restructured  system  is 
relatively  constant.  This  is  because  no  matter  how  many 
symptoms  are  entered,  the  program  only  goes  down  the  tree  for 
the  top  candidate  birth  defects.  On  the  other  hand,  the  old 
system  went  down  the  hierarchy  for  every  candidate  birth 
defect,  and  therefore,  as  more  symptoms  imply  more  candidates, 
its  resource  usage  varies  closely  with  the  number  of  symptoms. 

2)  The  number  of  I/O's  under  the  restructured  system  compared  to 
the  original  systen,  ranges  from  one-sixth  (with  four  symptoms) 
to  one  twenty-eighth  (with  seventeen  symptoms) .  (The  one 
symptom  case  only  retrieved  five  candidate  birth  defects,  and 
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therefore,  both  systems  had  similar  usage.) 

Using  the  regression  line  statistics  and  the  case  of  eleven  symptoms, 
the  new  system  saves  sixteen  CPU  seconds  and  1500  I/O  operations.  Using 
a  CPU  rate  of  $30  per  minute  and  an  I/O  rate  of  $2  per  thousand,  this 
amounts  to  a  savings  of  ?11  per  diagnosis.  With  100  users,  each  doing 
five  cases  per  month,  the  savings  could  reach  $5500  per  month. 
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Other  Issues  to  be  Considered 

Following  are  some  other  points  of  interest  that  have  been  touched 
upon  in  this  paper,  but  which  are  beyond  the  scope  of  this  analysis. 
Managerial  Implications: 

Often,  the  costs  of  the  query  and  the  costs  of  maintenance  are 
borne  by  two  different  organizations.  Therefore,  these  costs 
may  be  weighed  differently  by  a  syston  implementor. 
Technical  Issues: 

An  interesting  question  concerns  the  independence  of  multiple 
restructurings  of  the  same  database,  e.g.,  are  the  advantages 
obtained  by  collapsing  a  level  independent  of  other 
improvements  that  are  being  made  elsewhere. 
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Conclusion 

The  two  concepts  introduced  here,  family  redundancy  and  collapsing  of 
levels,  can  under  certain  circumstances,  lead  to  dramatic  savings.  The 
conditions  for  a  successful  database  restructuring  include: 

1)  A  high  ratio  of  database  accesses  to  updates. 

2)  A  well-defined  set  of  database  operations. 

3)  A  small  number  of  data  fields  that  must  be  made  redundant  in 
order  to  obtain  a  high  reduction  in  accesses. 

4)  A  systen  that  can  be  restructured  cheaply  and  efficiently. 
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